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Abstract — In adaptive dynamic programming, neurocontrol 
and reinforcement learning, the objective is for an agent to learn 
to choose actions so as to minimise a total cost function. In this 
paper we show that when discretized time is used to model the 
motion of the agent, it can be very important to do "clipping" 
on the motion of the agent in the final time step of the trajectory. 
By clipping we mean that the final time step of the trajectory 
is to be truncated such that the agent stops exactly at the first 
terminal state reached, and no distance further. We demonstrate 
that when clipping is omitted, learning performance can fail to 
reach the optimum; and when clipping is done properly, learning 
performance can improve significantly. 

The clipping problem we describe affects algorithms which 
use explicit derivatives of the model functions of the environment 
to calculate a learning gradient. These include Backpropagation 
Through Time for Control, and methods based on Dual Heuristic 
Dynamic Programming. However the clipping problem does 
not significantly affect methods based on Heuristic Dynamic 
Programming, Temporal Differences or Policy Gradient Learning 
algorithms. Similarly, the clipping problem does not affect fixed- 
length finite-horizon problems. 

I. Introduction 

In Adaptive Dynamic Programming (ADP) Q), Neurocon- 
trol Q, and Reinforcement Learning (RL) [3], an agent moves 
in a state space § C R n , such that at integer time step t, it has 
state vector f ( e §. T is a fixed set of terminal states, with 
T C S. At each time t, the agent chooses an action u t which 
takes it to the next state according to the environment's model 
function 

x t+ i = f(x t ,u t ), (1) 

thus the agent passes through a trajectory of states 
(xo, xi, X2, ■ ■ •)> terminating only when (and if) a terminal 
state is reached, as illustrated in Fig. [T] As shown in this 
figure, clipping is the concept of calculating the exact fraction 
in the final time step at which a boundary of terminal states 
is reached, and stopping the agent exactly at this boundary. 
The name clipping is taken by analogy to the concept in 
computer graphics. Without clipping, the discretization of time 
would cause the agent to penetrate slightly beyond the terminal 
boundary, as shown in the figure. 

On transitioning from each state x t to the next, the agent 
receives an immediate scalar cost Ut from the environment 
according to the function 

U t :=U(x t ,u t ). (2) 
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Terminal Boundary 

Fig. 1: A trajectory reaching a terminal state. The thick curved 
line indicates a boundary of terminal states. In this diagram, 
clipping does not take place, and the trajectory penetrates 
beyond the terminal boundary. When clipping is used correctly, 
we intend to stop the agent exactly at the point of intersection 
between the trajectory and the terminal boundary. 

In addition, if the agent reaches a terminal state x e T, then an 
additional terminal cost is given by the scalar function $(x). 

Throughout this paper, subscripts on variables will be used 
to indicate the time step of a trajectory. And from now on in 
the paper, we will only consider episodic, or finite horizon, 
environments; that is environments where all trajectories are 
guaranteed to meet a terminal state eventually. 

The ADP problem is for the agent to learn to choose 
actions so as to minimise the expectation of the total long-term 
cost received from any given start state xq. Specifically, the 
problem is to find an action network A(x, z), where z is the 
parameter vector of a function approximator, which calculates 
an action 

u t =A(x t , z) (3) 

to take for any given state x t , such that the following long- 
term cost is minimised: 

J ( f °' f ) : = ^7*^+7 T $(^t)^ (4) 

subject to ([TJ, (|2]i and where T is the time step at which 
the first terminal state is reached (which in general will be 
dependent on x*o and z), where 7 <E [0, 1] is a constant discount 
factor that specifies the relative importance of long-term costs 
over short term ones, and where (•) denotes expectation. 

The function J(x*o, z) is called the cost-to-go function from 
state x*o, or the value function. 

In this paper we show that when a large final impulse of 
cost $(x) is given at a terminal state x £ T, then failure 
to do clipping in the final timestep of the trajectory can 
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very significantly distort the direction of the learning gradient 
used by certain ADP algorithms, and thus prevent successful 
solution of the ADP problem. We also show that this problem 
is not lessened by sampling the time steps of the underlying 
continuous-time process at a higher rate. This problem affects 
the commonly used ADP algorithms Dual Heuristic Dynamic 
Programming (DHP) Q, and Backpropagation through 
time (BPTT) (6), both of which are described in Section 
p] plus algorithms based on DHP such as Value-Gradient 
Learning Q, @, 0. These algorithms are all very closely 
related to each other iflOll . ifTTTl . and for purposes of explaining 
clipping as clearly as possible, we will use BPTT as the 
example. 

BPTT works by calculating the quantity directly and 
very efficiently for each trajectory sampled, enabling gradient 
descent to be performed on J with respect to z. However 
without clipping being done correctly, the gradient that BPTT 
calculates can by distorted enough to prevent learning. Fig. [2] 
illustrates the problems that arise without clipping. 
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(a) Spurious zigzag gradients can 
occur when clipping is not used. 



(b) The graph of R versus 9 yields 
no useful local gradient informa- 
tion. Hence minimising R, with re- 
spect to 9 using only dR/d9 would 
be impossible. 



Fig. 2: An example of the problems that can occur when 
clipping is not used. 



In Fig. 2a the agent starts at O and travels in a straight line 
at a constant speed, along a fixed chosen initial angle, 8. The 
straight line AB is a terminal boundary (i.e. a continuous line 
of states in T). The dotted arcs represent the integer time steps 
that the agent passes through. If clipping is not used then the 
agent will stop on the first integer time step (i.e. on the first 
dotted arc) after passing the terminal boundary. This means 
the agent will finally stop at a point somewhere on the bold 



zigzag path from A to B. In Fig. 2b we see how the distance the 



agent travelled before stopping (R) varies with 8. If the cost- 
to-go function J was defined to be the total distance travelled 
before termination (i.e. if J := R), and the parameter vector 
of J was defined to be 6, then the ADP objective would be to 



minimise R with respect to 8. But Fig. 2b shows that there is 



no useful gradient information for learning, since = §f = 
0, whenever it exists, and hence gradient descent on J with 
respect to 8 would fail without clipping. 

Situations can get even worse than this: In Fig.[3]we show a 
pathological example where the gradient of the graph is always 
in the opposite direction of the global minimum of R. This 
could occur for example if we were trying to minimise the 
function J := R + y with respect to 8, for the situation in Fig. 
2a where y is the final y-coordinate of the agent, and R is 



the distance travelled before stopping. 
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Fig. 3: A pathological example: Local gradient is opposite to 
global gradient. 

In general, increasing the sampling rate of the discretization 
of time will not solve the problem, since that would simply 



make the dotted arcs in Fig. 2a squeeze closer together, and 



will make the teeth of the saw-tooth blade shape in Fig. [3]finer. 
The gradients in Figs. 2b and [3] would still not be helpful for 
learning. 

We show how to solve the problem by incorporating clip- 
ping into the model and cost functions, f(x,u) and U(x,u), 
when terminal states are reached. BPTT and DHP make 
intensive use of the derivatives of these two functions, and 
hence we must carefully differentiate through the clipped 
versions of these functions. This is the important step that 
we derive in this paper, and this step corrects the gradient 
^= to make it suitable for learning, and solves the problems 
explained by Fig. [2] and Fig. [3] 

As well as terminal boundaries in state space that deliver 
impulses of cost, similar corrections would need making in 
environments where the model and cost functions change 
their behaviour discontinuously as the agent traverses a given 
continuous boundary in state space. These boundaries would 
act as refraction layers do to photons. As the agent crosses 
them, the learning gradient would get twisted. The solution 
to this problem is similar to the one we propose for terminal 
boundaries, but we do not consider these non-terminal refrac- 
tion layers any more in this paper. 

The necessity for clipping affects any algorithm that calcu- 
lates the derivatives of the model function, i.e. || directly, and 
when terminal states that deliver impulses of cost are present. 
For example the RL method of lfl2l . which implements a 
continuous -time numerical differentiation to evaluate §i, will 
also be affected by this clipping problem. Likewise, the ADP 
methods of BPTT, DHP, GDHP 03] and Value-Gradient 
Learning are also affected by the requirement for clipping. 

Clipping is not necessary for any problem where the ter- 
mination condition is simply when a fixed integer number 
of time steps is reached, as we discuss further in Section 



III-D Also our experiments in this paper show that the ADP 
algorithm called Heuristic Dual Programming (HDP, |4|, (TJ, 
[5 1) does not need clipping, since this algorithm does not make 
significant use of the derivatives of the model function. The 
policy -gradient learning methods of lfl4l . |fl5l do not require 
clipping either, since they do not use the derivatives of the 
model function. 

In the rest of this paper, in Section [TT] we describe the af- 
fected ADP algorithms for control problems. In Section III we 



describe how to do the clipping and differentiate through the 
modified model functions, as is required for effective gradient 
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descent. In Section [IV] we give experimental details of neural- 
network control problems both with and without clipping. One 
of these problems is the classic cart-pole benchmark problem 
which we formulate in a way that would be impossible for 
DHP to solve without clipping, and we show that the clipping 
methods enable us to solve this problem efficiently. In Section 
|y] we give conclusions. 

II. The ADP/RL Learning Algorithms 

We describe three main ADP/RL algorithms first in their 
forms without clipping. 

A. Backpropagation Through Time For Control 

Backpropagation through time (BPTT) can be applied to 
control problems, as described by [6|. In this section we derive 
and describe the algorithm. This is an algorithm that requires 
clipping in the environments we consider in this paper. 

BPTT is an efficient algorithm to calculate ^= for a given 
trajectory. The combination of the BPTT gradient calculation 
with a gradient descent weight update can be used to solve 
control problems, i.e. by the weight update Az = — for 
some small positive learning rate a. 

Throughout this paper we make a notational convention that 
all vectors are columns, and differentiation of a scalar by a 
vector gives a column vector (e.g. is a column). We define 
differentiation of a vector function by a vector argument as the 
transpose of the usual Jacobian notation. For example, dAi ^j z ) 



is a matrix with element equal to ^4 r . Similarly, 



t is 



df 3 

dx 1 



the matrix with element ffg) 

Parentheses subscripted with a "t" are what we call 
trajectory-shorthand notation, which we define to indicate 
that a quantity is evaluated at time step t of a trajectory. 

du 

evaluated at (x t) u t )- Similarly, (§§) t+1 



dJ(x,z) 

dx 



3A(xS) 
dz 



(x t +i,z) 



(x t ,z) 



and (f§) : 

For any given trajectory starting at state xo, the function 
J(xo,z) given by (j4j) can be written recursively using equa- 
tions Q-'H]), as: 

J(x, z) := U(x, A(x, z)) + 7 J(/(x , A(x, z)), z) (5) 

with J(xt,z) '■= $(fr) at the trajectory's terminal state, 
x T G T. 

Differentiating <[3j with the chain rule gives: 



8J 

dz 
c) 



(U(x, A(x, z)) + 7 J(f(x, A(x, z)), z)) 



dz 

8A\ I fdU 



by ((5) 



dz 



du J t 'V du J t V dx J I ' \ dz J t+1 



where we used the chain rule, equations (jT]»-([3]) and trajectory- 
shorthand notation. In this equation there are implied matrix- 
vector products that make use of the matrix notation defined 
above. 



Expanding this recursion gives: 



d£ 
dz 




This equation refers to the quantity |4 which can be found 
recursively by differentiating Q and using the chain rule, 



giving 



\dx) t \dx ) t + 7 {dx ) t \dxj t+1 




du) t +7 (fluj t {dxJ t+1 J 

(7) 



with 



8J_ 

dx 



dx 



(8) 



at the terminal state, xt € T. 

Equation dTll can be understood to be backpropagating the 
quantity (^j t+1 through the action network, model and cost 
functions to obtain (ff) t , and giving the algorithm its name. 
Pseudocode for the whole BPTT algorithm is given in Alg. [Tj 
where lines [2j |6] and [7] of the algorithm come from equations 
d8), (|6]l and Q respectively. In the algorithm, the vector 
p holds the backpropagated value for jjL. Q x and Q u are 
the derivatives of the Q-function with respect to x and u 
respectively, where the Q-function is defined by 

Q{x, u, z) — U(x, u) + 7 J(/(x, u), z) 

The Q-function is a model based version of the Q-function 
defined in Q-learning [16|. It is similar to the cost-to-go 
function's recursive definition (|5}, but it differs in that it 
allows the first action chosen to be independent of the action 
network. This will be useful in deriving the clipping equations 



in Section III but for now Q x and Q u can just be treated as 
internal variables in Alg. [T] The BPTT algorithm runs in time 
0(dim(i*)) per trajectory step. 

Algorithm 1 Backpropagation Through Time for Control. 
Require: Trajectory calculated by ([T]) and Q. 
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p<- Qx + 
end for 
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B. Dual Heuristic Dynamic Programming (DHP) and Heuris- 
tic Dynamic Programming (HDP) 

Dual Heuristic Dynamic Programming (DHP) and Heuristic 
Dynamic Programming (HDP) are ADP algorithms which use 
a critic function, and can require clipping in the evironments 
we consider in this paper. Both of these algorithms were 
originally by Werbos [4] and are described more recently by 
0, ifTTl . (3, and we define them briefly here. 

The use of critic functions allows these two algorithms 
to apply their learning rule on-line, unlike the previously 
described BPTT which needed to wait until a trajectory was 
completed before it could apply the learning weight update. 
DHP makes use of a vector-critic function G(x, w) which 
produces a vector output of dimension M dlm ( :E ). This could 
be the output of a neural network with weight vector w and 
dim(a f ) inputs and outputs. The DHP weight update attempts 
to make the function G(x, w) learn to output the gradient ^L. 
HDP uses a scalar-critic function V(x, w) which produces a 
scalar output. This could be the output of a neural network 
with weight vector w and dim(af) inputs, and just one output 
node. The HDP weight update attempts to make the function 
V(x, w) learn to output the function J(x, z) for all x € S. 
HDP is equivalent to the algorithm "TD(0)" from the RL 
literature lfl8l . 

Pseudocode for DHP is given in Alg. [2] Line [9] of the 
algorithm trains the critic with a learning rate (3 > 0, and 



line 10 implements a commonly used actor weight update 
described by (using a learning rate a > 0). The algorithm 
uses the same matrix notation for Jacobians and trajectory- 
shorthand notation as described in Section |II-A| so that for 



is the function evaluated at (xt,w). 



example ^ 

Pseudocode for HDP is given in Alg. [3] Lines [8] and [9] 
give the critic and action-network weight updates, respectively. 
Again the action-network weight update is the one described 
by [5 1, but model-free alternatives which don't require knowl- 
edge of the derivatives of / are also possible (e.g. |j3] ch.6.6], 
or GU sec 4.2]). 

Backpropagation ( 1201 . 1211 ) can be used to efficiently 



ov ay 

dw ' d. 



^ and the products involving ^4 and 



act 

dw ' 



calculate 

Using this method, both DHP and HDP can be implemented 
in a running time of 0(n) operations per time step of the 
trajectory, where n = max(dim(w), dim(z)). 

The pseudocode gives explicit details of how the function 
is to be used instead of the critic at the final time step 
of a trajectory. This is an important detail that is necessary to 
implement clipping and finite-horizon problems correctly. 

III. Using and Differentiating Clipping in 
Learning 

In this section we derive the formulae for the clipped model 
and cost functions, and their derivatives. We will denote the 
clipped versions of the original functions with a superscripted 
C, so that f c , U c and J c will be the function names we 
use for the clipped versions of the model, cost and cost- 
to-go functions, respectively. The functions f c and U are 
only defined for any state x t that occurs immediately before a 



Algorithm 2 DHP with a critic network G(x,w) and 
action network A(x, z). 



t <- 

while x t £ T do 

u t «- A(x t ,z) 
Xt+1 <- f{x t ,u t ) 
dx)t+l 

G(x t+ i,w 




if x t+1 e T 
if x t+1 i T 

Q x + (U) t Qu-G(x u w) 

w + p ff^J e {Critic network update} 

z z — a (§j) t Qu {Action network update} 
t<-t+l 
end while 



Qi 

w 



Algorithm 3 HDP with a critic network V(x,w) and 
action network A(x, z) 




t <- 

while x t T do 

u t <- A(x t ,z) 
xt+i <- f(x t ,u t ) 

H) t+1 ifx t+1 GT 

(||) t+i if^^T 

_{<Hx t +i) ifft+ieT 

\v(x t+1 ,w) if x t+ i $ T 

w «- w + p(§,) t (u(x tl u t )+jV t+1 -V(x tl w) 
{Critic network update} 

z z — a (§4) t Q u {Action network update} 
t<-t+l 
end while 



Vi 



t+i 



terminal state is reached, i.e. for which xt £ T and for which 

f(x t ,u t ) G T. 

These three clipped functions, f c , U c and J c , are key 
concepts in this paper, because defining them clearly allows 
us to differentiate them carefully, and hence calculate the 
learning gradients correctly. This is what allows us to solve the 
clipping problem. Hence this section is the main contribution 
of this paper, in terms of implementation details for solving 
the clipping problem. 



A. Calculation of the Clipped Model and Cost Functions 

Suppose the agent is transitioning between states Xt and 
f(xt,ut), and the state f(x t ,Ut) would be beyond the ter- 
minal boundary unless clipping was applied. To calculate the 
clipping correctly, we imagine this state transition as occurring 
along the straight line segment from x t to f(xt,ut), i-e. the 
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f c (x t ,u t ,P,n) 



f(xt,u t ) 



Tangent Plane of 
Terminal Boundary 



Fig. 4: The final state transition of a trajectory crossing the 
tangent plane of a terminal boundary. The undipped line goes 
from x t to f(x t ,u t ). The line intersects the plane at a point 
given by the new clipped model function f c (x t , Ht, P, n). 



straight line given parametrically by position vector 

r — x t + Xv, (9) 

where 

v = f(x t ,u t ) - x t , (10) 

and A € [0, 1] is a real parameter. This is illustrated in Fig. |4] 
This straight line must intersect a boundary of terminal 
states. At the point of intersection, the tangent plane of the 
terminal boundary is given by (r — P) ■ n — (i.e. where r 
is an arbitrary position vector that lies on a plane which has 
normal n and passes through a point with position vector P, 
and where "■" denotes the inner product between two vectors), 
as illustrated in Fig. [4] The constants P and n should be 
available from either the physical environment or from the 
collision-detection routine of the simulated environment. 
At the intersection of the line and the plane, we have 



(x t + Xv — P) ■ n 
(P-x t )-n 







>X 



This value of A is a real number between and 1 which 
indicates the fraction along the transition line from xt to 
f(xt,u t ) at which the terminal boundary was encountered. We 
will refer to the value A as the "clipping fraction", and since 
it depends on x t , u t , P and n, it is defined by the function: 



A := A(x t ,u t ,P,n) :- 



(P - x t ) - ft 



(11) 



(f(x t ,Ut) -x t ) -n 
Hence the clipped value of the final state is Xt+i — x t + 
A(x t ,u t , P, n)(f(xt, u t ) — Xt), which is found by combining 
equations ( [T0| and ( |TT] >. This gives the function for the 
clipped model function as 

f c \x,u,P,n) := x + A(x,u,P,n)(f(x,u) - x). (12) 

Assuming that "cost" is delivered at a uniform rate during 
the final state transition, the total clipped cost would be 
proportional to the clipping fraction, giving: 

U c (x , u, P, ft) := A(x, u, P, ft)U{x, u). (13) 

Since the final clipped timestep has duration A G [0, 1], 
the terminal cost $(f-r) should only receive a discount of 7 A 



instead of the full discount 7. Hence, at the penultimate time 
step, Xt-i, the total cost-to-go is 

J c {x T -i,z) :^U c {x T ^u T ^ ll P,n)+ 1 x ^>{x T ). (14) 

This possibly seems like a pedantic detail, but it is this detail 
which allows us to solve a version of the cart-pole benchmark 
problem, which would otherwise be impossible for DHP, in 
Section HV-Bl 

Alg. |4] illustrates how equations ([T])-<|3j and ( 1 1 H 14 1 would 
be used to evaluate a trajectory with clipping. 

Algorithm 4 Unrolling a trajectory with clipping. 

i -e- 0, J c <- 
while x t T do 
u t <- A(x t ,z) 
Xt+l <- f(x t ,u t ) 
if x t +i G T then 

Identify P and n by inspection of the intersection 
with the terminal boundary, T. 
A <- A(x t ,u t ,P,n) 
t + 1 

X (x T - x t ) 
-( 7 *)(AC/(f t ,u t ) 



T < 

x T 

J c 
else 

J c 
end if 
t<-t + l 
end while 

T 4r- t 



J C 



7 A $(f T )) 



Note that P and n are required by equations (11i-(13i 



These would be found during the collision-detection routine 
(i.e. line [6] of Alg. ffl, from knowledge of the terminal- 
boundary orientation, together with knowledge of xt-i and 
f(xT-i, Ut-i)- Knowledge of the orientation of the terminal 
boundary could come from a model of the physical environ- 
ment's boundary; or if this model was not available, then a 
physical inspection of the actual boundary would need to take 
place. Examples of how these two vectors were found in our 
experiments are given in Sec. |IV-A| and |IV-B| 

B. Calculation of the Derivatives of the Clipped Model and 
Cost Functions 

The ADP algorithms described in Section [II] require the 
derivatives of the model function, and hence they will require 
the derivatives of the clipped model function f c (x,u, P,n) 
too. Fig. [5] shows how different the derivative of f c can be 
from the derivative of /, and hence how important it is to get 

this correct in ADP/RL. This figure clarifies why algorithms 

3 

that are dependent on -4^- are critically affected by the need 
for clipping, and also that just reducing the duration of each 
time step tracking or simulating the motion will not solve the 
problem at all. 

gives: 



Differentiating the formula for A(x,u,P,n) in 



11 



dA(x, u, P, ft) 
dx 



(P-x) 



{f(x, u)-x)-n 



by (11 
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X A 



Ax A 




Terminal Boundary 

Fig. 5: This diagram shows how the derivatives of the model 
function f(x,u) radically change as the agent approaches 
a terminal boundary. The straight line segment from xa to 
f(xA,UA) represents a state transition that is not intersect- 
ing the terminal boundary. If the start of this line segment 
is perturbed in the direction of the arrow Axa then its 
other end will move in the direction indicated by the arrow 
Af(xA,UA)- The line segment below, however, which starts 
at xb, does reach the terminal boundary. If the start of this 
line segment is moved in the direction of Axg, then its end 
will move in a perpendicular direction, as indicated by the 
arrow Af c (xb, ub, P, n). This indicates that (j^g-) is verv 

different from ( fgj > an d hence this needs treating carefully 
in the ADP algorithms. 



v ■ n 



(P — x) ■ n d(f(x, U) — x) ■ n 
(v ■ n) 2 dx 

(P — x) ■ n ( df 



using 
I I n 



10) 



v ■ n (v ■ n) 2 \dx 
where / is the identity matrix, and the matrix notation is as 



defined in Section II-A Similarly, 

dA(x,u,P,n) d I (P — x)-n 



du 



du \ (f(x, u) — x) ■ n J 

(P — x) ■ n d(f(x, u) — x) ■ n 

(v ■ n) 2 du 
(P-xyn /df\ 

(v ■ ft) 2 \du J 



using 



by (TTT 
10) 



(16) 



Using these derivatives of A(x, u, P, ft), we can now dif- 
ferentiate the clipped model and cost functions, giving: 



df c (x,u,P,n) 
dx 



df c (x,u,P,n) 
du 

dU c '(x,u, P, ft) 
dx 

dU c {x,u,P,n) 
du 



= 1 



dA 

dx 



v T + A 



dx 



-I 



by <[T0)-([T2) (17) 



dA^ df 
du du 

dx U{x > u) + X dt 

dA T .^ _ dU 
— U{x,u) + X — 
du du 



by ( 10 H 12 1 (18) 



by (13) (19) 
by ([13) (20) 



The cost-to-go function for the penultimate time step, equa- 
tion (14), can be rewritten as a Q-function of both x and u, 



to give 

Q(x T -i,Ut-i) :=[/ c (f T _i,u T _i,P, ft) 

+ 1 x $(f c (x'T-i 1 UT-i,P,n)). (21) 

Differentiating this with respect to Ut-1 or Xt-i gives: 



8Q\ = (9U^\ + \ ( fdf c \ fd^ 



(22) 



where • represents either u or x. 

This equation, which relies upon the derivatives of 
f (x,u,P,n) and U c (x,u,P,n) (as defined in equations 
(B) to (f20)), can be used to modify BPTT from Alg. [I] into 
its corresponding "with clipping" version given in Alg. |5l 



Equation ( 22 1 appears in the algorithm directly in lines pfl9^ 

The DHP and HDP algorithms need similar modifications 
to convert them to include clipping. Clipping needs applying 
to the final time step of the trajectory unroll, which can be 
implemented by replacing line [4] of both algorithms by lines 
|4fT3l of Alg. |4] Also, in the case of DHP (Alg. ~~ 
that calculate Q x and Q u need replacing by lines 
|5j and similarly the line that calculates Q u in Alg7[3] (HDP) 
needs the same modification. 

Algorithm 5 Backpropagation Through Time for Control, 
with Clipping. 

(15) Require: Trajectory calculated by Alg. |4] 
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C. Implementing Clipping Efficiently and Correctly 

To demonstrate how clipping would be correctly imple- 
mented with an ADP/RL algorithm, we use the BPTT al- 
gorithm for illustration. In an implementation of BPTT with 
clipping, we would first evaluate a trajectory by Alg.|4] During 
this stage, we would record the full trajectory (xq,Xi, . . . , xt) 
and actions (uq, u\, . . . , «t-i) ar, d also, during the collision 
with the terminal boundary, we would record P and n and the 
clipping fraction, A. We then have enough information to be 
able to run the BPTT algorithm with clipping (Alg. [5J. 

To ensure the correctness of our implementations in each 
experiment and environment which we tackled, we first veri- 
fied all of the derivatives of A(x, u, P, n), f c (x, u, P, n) and 
U (x,u, P,n) numerically, with respect to both x and u, 
at least a few times. When all of these derivatives were all 
satisfactorily programmed and checked, we then checked by 
numerical differentiation that the overall BPTT implementa- 
tion was calculating the derivative correctly. 

For an example of the numerical differentiations used, 
the final check of BPTT was done by a central-differences 
numerical derivative for each component i of the weight vector 
z, to verify that 

dJ J (xq, z + tei) — J c (xq, z — ee*i) 



2c 



0(e 2 ) 



where e is a small positive constant, and efj is the ith Euclidean 
standard basis vector. In this verification equation, each J (•) 
term appearing in the right-hand side would be computed by 
executing Alg. |4] from the trajectory start point xo', and the 
theoretical value of ^ig- appearing in the left-hand side would 
be computed by Alg. [5] 

In HDP and DHP, the derivatives of A(x , u, P, ft), 
f c (x,u, P,n) and U c (x,u 7 P,ri) would be calculated and 
verified as above. However with HDP and DHP it is more 
difficult to check the overall critic weight updates numerically, 
since they are not true gradient descent on any analytic 
function 1221 . For these algorithms, it was possible to verify 
the key algorithmic modifications related to clipping, by just 



checking the derivatives of the Q-function given by (22 1. These 
derivatives were compared to the numerical derivatives of (21 1 
with respect to x and u. 

D. Clipping with Trajectories of Fixed or Variable Finite 
Length 

In situations where trajectories are fixed finite length (com- 
monly referred to as a fixed-length finite-horizon problem), 
clipping is not necessary. This is in contrast to the problems 
we considered in the introduction, which were variable finite- 
length problems, since the trajectory lengths were determined 
by the environment (e.g. a trajectory terminates only when 
the agent crashes into a wall). In this section we will dis- 
tinguish between these two situations by referring to them 
as "fixed finite-length" and "variable finite-length" problems, 
respectively. Only in variable finite-length problems is clipping 
necessary. 

In the fixed finite-length problem, the clipping fraction 
defined by |TT| is always A = 1, and therefore = 0, 



§§ = and 7 A = 7. Hence the clipped model and cost 
functions are identical to their undipped counterparts, and 
therefore it is not necessary to implement any program code 
specifically to handle clipping. This might be one reason 
why the need for clipping has not previously been noted 
in the research literature, since most finite-horizon problems 
considered have been fixed-finite length. 

However the fixed finite-length problem does have one 
minor different complication, in that it is often necessary to 
include the time step into the state vector. This is because 
the optimal actions and cost-to-go function will often be 
dependent upon the number of incomplete steps in a trajectory. 

Of course for both fixed-length and variable-length finite- 
horizon problems, it is important to ensure the terminal cost 
function is learnt correctly by the learning algorithm. 

The pseudocode shows explicitly how to do this (e.g. for 
BPTT, see line [2] of Algs. [T] and [5] For DHP, see line [5] of 
Alg. And for HDP, see lines [5] and [6] of Alg. |3). 

IV. Experimental Results 

This section describes two neural-network based ADP/RL 
experiments which require clipping to be solved well. 

In all experiments the action and critic networks used were 
multi-layer perceptrons (MLPs, see ||23ll for details). Each 
MLP had dim(x) input nodes, 2 hidden layers of 6 nodes each, 
and one output layer, with short-cut connections connecting 
all pairs of layers. The output layers were dimensioned as 
follows: Each action network had dim(u) output nodes; each 
HDP critic network had 1 output node; and each DHP critic 
had dim(x) output nodes. All network nodes had bias weights, 
as is usual in MLP architectures. The activation functions 
used were hyperbolic tangent functions, except for the critic 
network's output layer which was always a linear activation 
function (with linear slope as specified in the individual 
experiments, below). At the start of each experimental trial, 
neural weights were initialised randomly in the range [—.1, .1], 
with uniform probability distribution. 

A. Vertical Lander problem 

A spacecraft is dropped in a uniform gravitational field, 
and its objective is to make a fuel-efficient gentle landing. 
The spacecraft is constrained to move in a vertical line, and a 
single thruster is available to make upward accelerations. The 
state vector x = (h, v, u) T has three components: height (h), 
velocity (v), and fuel remaining (it). The action vector, a, is 
one-dimensional (so that u = a € R) producing accelerations 
a € [0,1]. The Euler method with time-step At is used to 
integrate the motion, giving model functions: 

f((h, v, u) T , a) =(h + vAt, v+(a- k g )At, (k u )u - aAt) T 
U((h, v, u) T , a) =(k f )aAt (23) 

Here, k g = 0.2 is a constant giving the acceleration due to 
gravity; the spacecraft can produce greater acceleration than 
that due to gravity, kf = 4 is a constant giving fuel penalty. 
k u = 1 is a unit conversion constant. We used At — 1 in our 
main experiments here. 
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Trajectories terminate as soon as the spacecraft hits the 
ground (h = 0) or runs out of fuel (u — 0). These two 
conditions define T. This is a variable finite-length problem, 
and there is no need to use a discount factor, so we fixed 7 = 1. 
On termination, the algorithms need to choose values for P, 
and n which describe the orientation of the terminal-boundary 
tangent plane. These choices are given for this experiment in 
Table [I] In the case that the final un-clipped state transition 
crosses both terminal planes, then the one that is crossed first 



(i.e the one that produces a smaller clipping fraction by (Hi) 
is to be used. 

In addition to the cost function U (x, a) defined above, a 
final impulse of cost defined by §>(xt) '■= \mv 2 + m(k g )h 
is given as soon as the lander reaches a terminal state, where 
m = 2 is the mass of the spacecraft. The two terms in the 
final impulse of cost are the kinetic and potential energy, 
respectively. The first cost term penalises landing too quickly. 
The second term is a cost term equivalent to the kinetic energy 
that the spacecraft would acquire by crashing to the ground 
under free fall (i.e. with a = 0), so to minimise this cost the 
spacecraft must learn to not run out of fuel. 

The input vector to the action and critic networks was 
x! = (h/WQ, v/10, u/50) T , and the model and cost functions 
were redefined to act on this rescaled input vector directly. 
The action network's output y was rescaled to give the action 
by A(x,z) :— (y + l)/2 directly. We tested each algorithm 
in batch mode, operating on five trajectories simultaneously. 
Those five trajectories had fixed start points, which had been 
randomly chosen in the region h € (0,100), v € (—10,10) 
and u = 30. 

Fig. [6] shows learning performance of the BPTT, DHP 
and HDP algorithms, both with and without clipping. Each 
graph shows five curves, and each curve shows the learning 
performance from a different random weight initialisation. The 
learning rates for the three algorithms were: BPTT (a = 0.01); 
DHP (a = 0.001, B = 0.00001); and HDP (a = 0.00001, 
j3 = 0.00001). The critic-network's output layer's activation 
function had a linear slope of 20 in the DHP experiment and 
10 in the HDP experiment. 

Because HDP is an algorithm which requires stochastic 
exploration to optimise the ADP/RL problem effectively [24|, 
in the HDP experiment we had to modify ([3]) to choose 
exploratory actions. Hence for the HDP experiment we used 

u t = A(x t ,z) + X a , 

where X a is a normally distributed random variable with mean 
zero and standard deviation a = 0.1. 

These graphs show the clear stability and performance 
advantages of using clipping correctly for the BPTT and DHP 
algorithms. The graphs also confirm that the HDP algorithm 
is not significantly affected by the need for clipping. 

Fig. [7] shows that the need for clipping is not made arbi- 
trarily small by just using a smaller At value. 
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Fig. 6: Vertical Lander solutions by BPTT, DHP and HDP 
using At = 1. 
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Fig. 7: Vertical Lander with At = 0.01. 



total trajectory cost is a function of the duration that the 
pole could be balanced for. Clearly, unless clipping is used 
properly, the duration will be an integer number of time steps, 
and since this is not smooth and differentiable, it will cause 
problems (become impossible) for DHP and BPTT. Hence 
traditionally when DHP or BPTT are used for the cart-pole 
problem, a different cost function would be used, one that 
is differentiable and proportional to the deviation from the 
balanced position (e.g. see |26|). However in this section we 
show that by using clipping, DHP and BPTT can be successful 
with the duration-based reward. Since it is not possible to do 
this without clipping, we assume this is the first published 
version of this solution by DHP/BPTT. 

The equation of motion for the frictionless cart-pole system 

una, ma, ma) is: 



g sin 9 — 


cos 8 


F+mW 2 sin9 


7n c +m 


I 


4 rn cos 2 6 
3 m c +m 





(24) 



TABLE I: Terminal Boundary Planes used in vertical-lander 
experiment. The state vector used here is x = (h,v,u) T . 



B. Cart Pole Experiment 

We investigated the effects of clipping in the well known 
cart-pole benchmark problem described in Fig. [8] We con- 
sidered the version of this problem used by E5K where the 
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Fig. 8: Cart-pole benchmark problem. A pole with a pivot 
at its base is balancing on a cart. The objective is to apply 
a changing horizontal force F to the cart which will move 
the cart backwards and forwards so as to balance the pole 
vertically. State variables are pole angle, 9, and cart position, 
x, plus their derivatives with respect to time, 9 and x. 

TABLE II: Terminal Boundary Planes used in cart-pole exper- 
iment. The state vector used here is x = (x, x, 9, 9, t) T . 



control force F = A(x, z) = lOy. The learning rates for 
the algorithms that we used were: BPTT {a = 0.1); (DHP: 
a = 0M,a = 0.0001). The DHP critic used a final-layer 
activation-function slope of 0.1. 

Learning took place on five trajectories simultaneously, with 
fixed start points randomly chosen from the region \x\ < 2.4, 
\9\ < jg, x — 0, 9 = 0. This is similar to the starting 
conditions used by l25ll . The exact derivatives of the model 
and cost functions were made available to the algorithms. 

The performance of the two algorithms, both with and 
without clipping, are shown in Fig. [9] Each graph shows how 
the average balancing duration over all five trajectories, versus 
the training iteration. Each graph shows an ensemble of five 
different curves, with each curve represents a training run from 
a different random weight initialisation. 
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where graviational acceleration, g — 9.8ms -2 ; cart's mass, 
m c = lkg; pole's mass, m — 0.1kg; half pole length, 
I = 0.5m; F G [—10,10] is the force applied to the cart, 
in Newtons; and the pole angle, 9, is measured in radians. 
The motion was integrated using the Euler method with a time 
constant At = 0.02, which, for a state vector x = (x, 9, x, 9) T , 
gives a model function f(x, u) = x + (x, 9, x, 9) T At. 

The pole motion continues until it reaches a terminal state 
or until the pole is successfully balanced for 300 time steps, 
i.e. 6 seconds of real time. Terminal states (T) are defined to 
be any state with |x| > 2.4, or \9\ > (i.e. 12 degrees), or 
t > 300. Termination plane constants are given in Table [II] 

The duration-based cost function of 1251 is defined as 
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U(x, u) 



1 if x E T and t < 300 
if x <£ T or t = 300 



(26) 



and when this is combined with a discount factor 7 < 1 
gives a total trajectory cost of J(xo,z) = where T 

is the trajectory duration. Since this function decreases with 
T, mimimising it will increase T, i.e. lead to successful pole 
balancing. 

We tested the two algorithms BPTT and DHP on this 
problem with a discount factor 7 = 0.97. To ensure 
the state vector was suitably scaled for input to the both 
MLPs, we used rescaled state vectors x! defined by x! = 
(0.16a;, 159/ir, x, 4(9) T , with 9 in radians, throughout the 
implementation. As noted by l26l . choosing an appropriate 
state-space scaling can be critical to successful convergence of 
actor-critic architectures in the cart-pole problem. The output 
of the action network, y, was multiplied by 10 to give the 



Fig. 9: Cart-pole solutions by BPTT and DHP. 

The results show that using clipping correctly enables 
both the DHP and BPTT algorithms to solve this problem 
consistently, and without clipping it is impossible for both 
algorithms. 

V. Conclusions 

The problem of clipping for ADP/RL and neurocontrol 
algorithms has been demonstrated and motivated. Without 
clipping, algorithms which rely on the derivatives of the model 
and cost functions can fail to work. The solution is to apply 
clipping, and then to correctly differentiate the model and cost 
functions in the final time step. This solution has been given in 
the form of the equations, plus in the form of clear pseudocode 
for the two major affected ADP algorithms: DHP and BPTT. 

Two neural network experiments have confirmed the impor- 
tance of applying clipping correctly. These included a cart- 
pole experiment, where clipping was found to be essential, 
and in a vertical-lander experiment, where clipping produced 
a significant improvement of performance. 

The situations in which clipping are needed have been made 
clear, and those situation where it can be ignored have also 
been specified. 
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